聚类分析是机器学习中的关键任务之一。传统上,聚类一直是一项独立的任务,与异常检测分开。由于离群值可以大大侵蚀聚类的性能,因此,少数算法尝试在聚类过程中掺入离群值检测。但是,大多数这些算法基于基于无监督的分区算法,例如K-均值。鉴于这些算法的性质,它们通常无法处理复杂的非凸形簇。为了应对这一挑战,我们提出了SSDBCODI,这是一种半监督密度的算法。 SSDBCODI结合了基于密度的算法的优势,这些算法能够处理复杂形状的簇,以及半监督元素,该元素具有灵活性,可以根据一些用户标签调整聚类结果。我们还将离群检测组件与聚类过程合并。根据过程中产生的三个分数检测到潜在离群值:(1)达到性得分,该得分衡量了一个点的密度可至关重要是对标记的正常物体的测量值,(2)局部密度得分,该局部密度得分,它测量了相邻密度的密度数据对象和(3)相似性得分,该分数测量了一个点与其最近标记的异常值的接近度。然后,在下一步中,在用于训练分类器以进一步群集和离群值检测之前,基于这三个分数为每个数据实例生成实例权重。为了增强对拟议算法的理解,为了进行评估,我们已经针对多个数据集上的某些最新方法运行了拟议的算法,并分别列出了除聚类外检测的结果。我们的结果表明,我们的算法可以通过少量标签获得优异的结果。
translated by 谷歌翻译
由于数据集的多样性,姿势估计量的概括能力很差。为了解决这个问题,我们通过DH向前运动学模型提出了姿势增强解决方案,我们称之为DH-AUG。我们观察到,先前的工作都是基于单帧姿势增强的,如果将其直接应用于视频姿势估计器,则将存在一些先前忽略的问题:(i)骨旋转的角度歧义(多个溶液); (ii)生成的骨骼视频缺乏运动连续性。为了解决这些问题,我们提出了一个基于DH正向运动学模型的特殊发电机,该模型称为DH生成器。广泛的实验表明,DH-AUG可以大大提高视频姿势估计器的概括能力。另外,当应用于单帧3D姿势估计器时,我们的方法的表现优于先前的最佳姿势增强方法。源代码已在https://github.com/hlz0606/dh-aug-dh-forward-kinematics-model-driven-driven-augmentation-for-3d-human pose-easteration上发布。
translated by 谷歌翻译
基于卷积神经网络的面部伪造检测方法在训练过程中取得了显着的结果,但在测试过程中努力保持可比的性能。我们观察到,检测器比人工制品痕迹更容易专注于内容信息,这表明检测器对数据集的内在偏置敏感,这会导致严重的过度拟合。在这一关键观察的激励下,我们设计了一个易于嵌入的拆卸框架,以删除内容信息,并进一步提出内容一致性约束(C2C)和全球表示对比度约束(GRCC),以增强分解特征的独立性。此外,我们巧妙地构建了两个不平衡的数据集来研究内容偏差的影响。广泛的可视化和实验表明,我们的框架不仅可以忽略内容信息的干扰,而且还可以指导探测器挖掘可疑的人工痕迹并实现竞争性能。
translated by 谷歌翻译
随着GAN的出现,面部伪造技术被严重滥用。即将实现准确的伪造检测。受到PPG信号对应于脸部视频中心跳引起的肤色的周期性变化的启发,我们观察到,尽管在伪造过程中不可避免地损失了PPG信号,但仍然存在PPG信号的混合物,但PPG信号的混合伪造视频具有独特的节奏模式,具体取决于其生成方法。在这一关键观察中,我们提出了一个针对面孔检测和分类的框架,包括:1)用于PPG信号过滤的时空滤波网络(STFNET),以及2)用于约束和约束的时空交互网络(stinet) PPG信号的相互作用。此外,通过深入了解伪造方法的产生,我们进一步提出了源头和源中的材料,以提高框架的性能。总体而言,广泛的实验证明了我们方法的优势。
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.
translated by 谷歌翻译
Masked image modeling (MIM) has shown great promise for self-supervised learning (SSL) yet been criticized for learning inefficiency. We believe the insufficient utilization of training signals should be responsible. To alleviate this issue, we introduce a conceptually simple yet learning-efficient MIM training scheme, termed Disjoint Masking with Joint Distillation (DMJD). For disjoint masking (DM), we sequentially sample multiple masked views per image in a mini-batch with the disjoint regulation to raise the usage of tokens for reconstruction in each image while keeping the masking rate of each view. For joint distillation (JD), we adopt a dual branch architecture to respectively predict invisible (masked) and visible (unmasked) tokens with superior learning targets. Rooting in orthogonal perspectives for training efficiency improvement, DM and JD cooperatively accelerate the training convergence yet not sacrificing the model generalization ability. Concretely, DM can train ViT with half of the effective training epochs (3.7 times less time-consuming) to report competitive performance. With JD, our DMJD clearly improves the linear probing classification accuracy over ConvMAE by 5.8%. On fine-grained downstream tasks like semantic segmentation, object detection, etc., our DMJD also presents superior generalization compared with state-of-the-art SSL methods. The code and model will be made public at https://github.com/mx-mark/DMJD.
translated by 谷歌翻译
Recently, great progress has been made in single-image super-resolution (SISR) based on deep learning technology. However, the existing methods usually require a large computational cost. Meanwhile, the activation function will cause some features of the intermediate layer to be lost. Therefore, it is a challenge to make the model lightweight while reducing the impact of intermediate feature loss on the reconstruction quality. In this paper, we propose a Feature Interaction Weighted Hybrid Network (FIWHN) to alleviate the above problem. Specifically, FIWHN consists of a series of novel Wide-residual Distillation Interaction Blocks (WDIB) as the backbone, where every third WDIBs form a Feature shuffle Weighted Group (FSWG) by mutual information mixing and fusion. In addition, to mitigate the adverse effects of intermediate feature loss on the reconstruction results, we introduced a well-designed Wide Convolutional Residual Weighting (WCRW) and Wide Identical Residual Weighting (WIRW) units in WDIB, and effectively cross-fused features of different finenesses through a Wide-residual Distillation Connection (WRDC) framework and a Self-Calibrating Fusion (SCF) unit. Finally, to complement the global features lacking in the CNN model, we introduced the Transformer into our model and explored a new way of combining the CNN and Transformer. Extensive quantitative and qualitative experiments on low-level and high-level tasks show that our proposed FIWHN can achieve a good balance between performance and efficiency, and is more conducive to downstream tasks to solve problems in low-pixel scenarios.
translated by 谷歌翻译
Rigorous guarantees about the performance of predictive algorithms are necessary in order to ensure their responsible use. Previous work has largely focused on bounding the expected loss of a predictor, but this is not sufficient in many risk-sensitive applications where the distribution of errors is important. In this work, we propose a flexible framework to produce a family of bounds on quantiles of the loss distribution incurred by a predictor. Our method takes advantage of the order statistics of the observed loss values rather than relying on the sample mean alone. We show that a quantile is an informative way of quantifying predictive performance, and that our framework applies to a variety of quantile-based metrics, each targeting important subsets of the data distribution. We analyze the theoretical properties of our proposed method and demonstrate its ability to rigorously control loss quantiles on several real-world datasets.
translated by 谷歌翻译
Recently, large-scale pre-trained models have shown their advantages in many tasks. However, due to the huge computational complexity and storage requirements, it is challenging to apply the large-scale model to real scenes. A common solution is knowledge distillation which regards the large-scale model as a teacher model and helps to train a small student model to obtain a competitive performance. Cross-task Knowledge distillation expands the application scenarios of the large-scale pre-trained model. Existing knowledge distillation works focus on directly mimicking the final prediction or the intermediate layers of the teacher model, which represent the global-level characteristics and are task-specific. To alleviate the constraint of different label spaces, capturing invariant intrinsic local object characteristics (such as the shape characteristics of the leg and tail of the cattle and horse) plays a key role. Considering the complexity and variability of real scene tasks, we propose a Prototype-guided Cross-task Knowledge Distillation (ProC-KD) approach to transfer the intrinsic local-level object knowledge of a large-scale teacher network to various task scenarios. First, to better transfer the generalized knowledge in the teacher model in cross-task scenarios, we propose a prototype learning module to learn from the essential feature representation of objects in the teacher model. Secondly, for diverse downstream tasks, we propose a task-adaptive feature augmentation module to enhance the features of the student model with the learned generalization prototype features and guide the training of the student model to improve its generalization ability. The experimental results on various visual tasks demonstrate the effectiveness of our approach for large-scale model cross-task knowledge distillation scenes.
translated by 谷歌翻译